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ABSTRACT 

We  introduce  a  model  for  communication  costs  in  parallel  processing  environments,  called  the 
“hyperbolic  model,”  which  generalizes  two-parameter  dedicated-link  models  in  an  analytically  sim¬ 
ple  way.  Dedicated  interprocessor  links  parameterized  by  a  latency  and  a  transfer  rate  that  are 
independent  of  load  are  assumed  by  many  existing  communication  models;  such  models  are  unreal¬ 
istic  for  workstation  networks.  The  communication  system  is  modeled  as  a  directed  communication 
graph  in  which  terminal  nodes  represent  the  application  processes  that  initiate  the  sending  and  re¬ 
ceiving  of  the  information  and  in  which  internal  nodes,  called  communication  blocks  (CDs),  reflect 
the  layered  structure  of  the  underlying  communication  architecture.  The  direction  of  graph  edges 
specifles  the  flow  of  the  information  carried  through  messages.  Each  CB  is  characterized  by  a 
two-parameter  hyperbolic  function  of  the  message  size  that  represents  the  service  time  needed  for 
processing  the  message.  The  parameters  are  evaluated  in  the  limits  of  very  large  and  very  small 
messages.  Rules  are  given  for  reducing  a  communication  graph  consisting  of  many  CBs  to  an 
equivalent  two-parameter  form,  while  maintaining  an  approximation  for  the  service  time  that  is 
exact  in  both  large  and  small  limits.  The  model  is  validated  on  a  dedicated  Ethernet  network  of 
workstations  by  experiments  with  communication  subprograms  arising  in  scientiflc  applications,  for 
which  a  tight  fit  of  the  model  predictions  with  actual  measurements  of  the  communication  and  syn¬ 
chronization  time  between  end  processes  is  demonstrated.  The  model  is  then  used  to  evaluate  the 
performance  of  two  simple  parallel  scientific  applications  from  partial  dilferentialequations:  domain 
decomposition  and  time-parallel  multigrid.  In  an  appropriate  limit,  we  also  show  the  compatibility 
of  the  hyperbolic  model  with  the  recently  proposed  LogP  model. 
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1  Introduction 

The  goal  of  this  paper  is  to  introduce  a  uniform  framework  for  analyzing  and  predicting  com¬ 
munication  performance  of  paraUel  algorithms  in  real  parallel  processing  environments.  We 
include  under  “parallel  processing  environments”  systems  supporting  computing  both  on  tradi¬ 
tional  dedicated  tightly  coupled  parallel  computers  (usually  termed  “inultiprocessor  systems”) 
and  on  clusters  of  loosely  coupled  workstations  (usually  termed  “distributed  systems”).  How¬ 
ever,  “multitasking,”  that  is,  the  simultaneous  execution  of  randomly  interfering  parallel  jobs, 

is  ©X eluded* 

There  are  two  basic  elements  of  a  parallel/distributed  computation:  the  end  processes  that 
send,  receive,  manipulate  and  transform  data  and  the  links  along  which  data  flow,  forming  a 
network  having  both  structural  and  dynamic  properties. 

The  issue  of  communication  is  only  recently  beginning  to  receive  attention  in  keeping  with 
its  importance  in  models  of  parallel  computation.  Most  parallel  models  following  the  precedent 
of  [6]  start  with  the  assumption  of  “perfect”  communication,  namely  no  delay  and  unlimited 
bandwidth.  Algorithms  based  on  such  models  may  appear  to  be  highly  performant,  but  more 
realistic  assumptions  [4]  about  the  underlying  communication  system  reveal  significant  degra¬ 
dation  of  their  behavior. 

In  designing  and  analyzing  parallel  algorithms,  either  we  have  to  make  assumptions  about 
the  properties  of  the  software/hardware  links  over  which  messages  are  exchanged  or  these 
properties  are  implicit  in  the  computational  model  used.  The  assumptions  relate  to  the  message 
reliability  and  the  responsiveness  of  the  communication  network,  the  following  being  the  most 
common: 

Ai  Messages  exchanged  between  end  processes  are  not  corrupted. 

A2  No  duplicates  of  transmitted  messages  are  generated. 

A3  Between  any  pair  of  end  processes,  messages  are  received  in  the  order  they  were  sent. 

A4  The  delay  is  bounded,  that  is,  it  is  guaranteed  that  a  sent  message  will  be  delivered  to  the 
destination  end  process  within  a  certain  fixed  time. 

The  overhead  of  enforcing  these  assumptions  is  often  not  taken  into  account.  Instant  commu¬ 
nication  (implying  a  communication  delay  equal  to  zero)  is  assumed.  A  common  idealization 
is  to  assume  that  an  unlimited  number  of  processors  can  use  unlimited  bandwidth. 

Besides  these  considerations  about  the  theoretical  approaches  to  parallel  computing,  our 
approach  is  motivated  by  factors  showing  the  increasing  importance  of  communication  in  the 
area  of  parallel/distributed  computing: 

•  The  need  for  improving  evaluation  of  complexity  and  efficiency  of  parallel  algorithms. 

•  Technological  trends.  The  increasing  performance  and  memory  capacity  of  the  processing 
nodes  in  parallel  computers  and  in  workstation  clusters  [8]  place  heavier  demands  on  the 
communication  between  nodes.  It  is  unrealistic  to  assume  that  communication  is  bounded 
as  more  data  are  stored  and  processed  on  each  node.  On  the  other  hand,  technological 
advances  in  the  communication  and  network  interface  technologies  come  at  a  slower  pace 
than  those  in  (micro)processor  performance  and  increased  memory  capacity.  This  has 
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had  and  will  continue  to  have  the  effect  of  making  communication  overhead  the  main 
bottleneck  for  the  overall  performance  of  parallel  algorithms. 

•  The  revival  of  distributed  computing.  There  is  an  economically  driven  shift  toward 
using  existing  clusters  of  workstations  in  high  performance  distributed  computing,  as 
an  alternative  to  dedicated  parallel  computers.  Over  sixty  publically  available  systems 
for  workstation  collaboration  are  annotated  in  [18].  The  communication  links  and  the 
communication  software  being  embedded  in  general  purpose  operating  systems  running 
on  the  processing  nodes  have  distinct  features  that  must  be  considered. 

This  paper  introduces  a  new  communication  model  for  the  evaluation  of  end-to-end  commu¬ 
nication  costs  in  parallel  processing  environments.  The  computational  tasks  are  accomplished 
by  end  processes  that  communicate  using  message  passing.  Messages  are  passed  through  com¬ 
munication  blocks,  whose  parameters  characterize  the  overall  hardware  and  software  links. 
The  communication  network  itself  is  a  communication  block  whose  overall  parameters  are 
presumably  unknown,  but  derivable  for  a  given  message  pattern.  For  situations  commonly  en¬ 
countered  in  real  systems:  passing  messages  from  the  same  source  over  multiple  communication 
blocks,  processing  incoming  messages  from  the  same  source  in  parallel  by  distinct  processors 
or  by  the  same  processor,  and  concurrent  access  of  a  single  communication  block  by  different 
message  sources,  we  give  rules  for  reducing  the  corresponding  communication  topology  to  a 
single  equivalent  communication  block. 

Although  the  model  is  expressed  in  terms  of  message-passing  primitives,  it  has  applicability 
to  other  communication  paradigms  commonly  used  in  parallel  programming.  For  example,  the 
shared  memory  model  of  communication  can  be  expressed  in  terms  of  a  message  passing  model 
through  the  communication  primitives  send  and  receive. 

Assumptions  Ai  -  A4  above  are  related  to  the  reliability  of  the  communication  network. 
It  is  the  responsibility  of  the  underlying  layers  of  communication  protocols  (software  links) 
to  ensure  that  these  assumptions  will  be  always  true  for  any  end  process  or  any  pair  of  end 
processes  participating  in  the  computation.  This  is  achieved  in  common  operating  systems 
by  a  layered  communication  architecture.  However,  in  the  case  of  the  tightly  coupled  parallel 
computers,  where  these  properties  can  be  supported  directly  by  the  hardware  (through  a 
highly  reliable  interconnection  network  and  simple  hardware  protocols),  we  assume  that  no 
other  requirement  is  enforced  on  the  communication  links.  That  is,  we  assume  that  only  the 
minimal  requirements  of  a  reliable  communication  are  met  and  no  specific  protocols  supporting 
other  costly  facilities  are  implemented  in  addition. 

We  focus  throughout  this  paper  on  the  general  case  of  communication  within  a  cluster 
of  workstations  cooperating  in  a  distributed  computation.  Tightly  coupled  multiprocessing  is 
included,  and  we  analyze  how  it  compares  with  LogP  model  [4]  in  the  last  section.  However, 
distributed  computing  is  more  challenging  than  multiprocessing  for  reasons  beyond  the  obvious 
higher  average  latency  and  smaller  average  bandwidth  per  node: 

1.  The  individual  nodes  in  cluster  computing  environments  are  powerful  full-function  com¬ 
puters  with  a  fully  developed  memory  hierarchy,  running  a  non-dedicated  general  purpose 
operating  system. 

2.  Unlike  a  multiprocessor  system,  a  network  of  workstations  has  no  dedicated  hardware 
links  between  processing  nodes.  In  one  form  or  another,  depending  on  the  physical  and 
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data  link  layer  characteristics,  contention  does  occur.  Of  course,  contention  might  also 
occur  in  multiprocessor  communication;  however,  its  extent  and  impact  can  be  limited 
by  either  the  specific  interconnection  topologies  or  by  a  careful  implementation  of  a  given 
parallel  algorithm. 

Another  important  problem  when  evaluating  the  performance  of  parallel  algorithms  in 
practice  is  the  distribution  of  work  among  processing  nodes.  This  is  important  in  miiltitasking 
environments  (especially  in  distributed  computing)  where  the  processing  nodes  are  not  guar¬ 
anteed  to  be  available  at  all  times.  This  impacts  not  only  the  computation;  communication 
may  also  be  affected  by  the  presence  of  other  processes  contending  for  the  use  of  the  com¬ 
munication  links.  However,  it  is  not  the  goal  of  the  present  work  to  address  the  problem  of 
fluctuating  loads  and  its  possible  effects  on  the  performance  of  parallel  algorithms  either  from 
the  computation  or  from  the  communication  point  of  view.  We  assume  throughout  the  paper 
that  a  single  end  process  is  running  at  a  given  processing  node.  An  interesting  report  exploring 
the  opposite  extreme  (interfering  end  processes,  but  free  communication)  is  [13]. 

This  paper  is  organized  as  follows: 

Section  2  formally  defines  the  hyperbolic  model  and  an  algebra  of  four  rules  for  reducing  a 
communication  graph  to  a  single  communication  block.  Each  of  the  rules  is  illustrated  with  a 
simple  example. 

Section  3  describes  communication  patterns  generated  by  operations  common  in  parallel  al¬ 
gorithms  (broadcast,  global  operations,  synchronization  and  nearest  neighbor  communication) 
and  describes  how  these  can  be  evaluated  using  the  model.  Results  obtained  in  experiments 
with  these  operations  in  a  distributed  computing  environment  are  presented  as  a  first  step  in 
validating  the  model.  Predictions  and  experiments  disagree  by  at  most  15%  over  a  range  of 
message  sizes  from  one  byte  to  64Kbytes,  on  up  to  16  dedicated  workstations  connected  by 
Ethernet. 

Section  4  presents  a  method  for  determining  the  parameters  of  the  communication  blocks 
when  modeling  communication  in  a  cluster  of  workstations. 

Section  5  describes  two  model  parallel  scientific  applications  used  to  test  our  model,  giving 
rise  to  various  communication  patterns  as  well  as  a  wide  range  of  message  size  distributions 
and  communication  to  computation  ratios.  Results  of  the  experiments  support  the  model  in 
fitting  with  predictions  of  the  cost  of  communication  between  end  processes,  to  within  the 
limits  of  control  of  interprocessor  synchronization. 

Section  6  describes  the  LogP  model  of  computation  for  massively  parallel  processors  [4] 
and  shows  favorable  comparison  of  the  hyperbolic  model  with  it  in  the  small  message  regime, 
for  which  the  comparison  is  easily  made. 

2  The  Hyperbolic  Communication  Model 

Given  a  set  of  source  nodes  S,  a  set  of  destination  nodes  D,  and  a  set  of  messages  M  in  a 
parallel  processing  environment  such  that: 

1.  every  message  in  M  is  sent  by  a  node  in  S  to  a  node  in  D; 

2.  e\:ery  node  in  S  sends  at  least  one  message  and  all  messages  it  sends  are  in  M; 

3.  every  node  in  D  receives  at  least  one  message  and  all  messages  it  receives  are  in  M; 
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our  goal  is  to  estimate  for  every  message  in  M: 

•  the  time  interval  between  the  sending  of  the  message  and  its  delivery  to  the  destination; 

•  the  time  interval  required  by  the  source  node  to  send  it; 

•  the  time  interval  required  by  the  destination  node  to  receive  it. 

The  sets  D,  S,  M  determine  the  state  of  the  communication  system  which  is  represented 
as  a  directed  graph  called  a  communication  graph  (CG  for  short).  A  CG  has  two  types  of 
nodes:  terminal  nodes  and  internal  nodes.  The  terminal  nodes  represent  the  end  processes 
that  ultimately  initiate  the  sending  (source  node)  and  receiving  (destination  node)  of  the  data, 
while  the  internal  nodes  embed  all  the  functions  performed  in  software  and  hardware  by  the 
communication  protocols  in  order  to  deliver  data  from  source  to  the  destination.  The  direction 
of  graph  edges  specifies  the  data  flow. 

Between  any  two  terminal  nodes  the  data  is  passed  in  byte  streams  of  any  size  called 
messages.  A  message  is  generated  by  only  one  source  node  and  delivered  to  only  one  destination 
node.  At  the  source  node  a  message  m  is  represented  by  a  pair  m{x,  dest),  where  x  is  the 
message  size  and  dest  is  the  destination  node  to  which  the  message  is  to  be  delivered.  At 
lower  communication  levels  a  message  is  usually  split  in  smaller  data  units  of  limited  size, 
called  packets  (if  the  message  is  small  enough  to  fit  in  a  packet  then,  obviously,  it  is  not  split). 
Associated  with  each  edge  is  a  list  of  the  messages  sent  along  that  edge. 

An  internal  node  or  Communication  Block  {CB  for  short)  is  an  abstract  module  that  per¬ 
forms  the  communication  protocol  functions.  Among  these  functions  are:  splitting  of  messages 
in  packets  for  passing  to  another  CB,  recovering  lost  or  corrupted  packets,  and  routing  the 
packets  in  the  network. 

We  say  that  two  or  more  CBs  are  dependent  iff  only  one  of  them  can  process  data  at  a 
moment  in  time  and  independent  iff  at  any  moment  in  time  all  CBs  can  process  different 
streams  of  data  without  interfering.  For  example,  two  CBs  running  on  different  processors  are 
independent,  while  if  they  run  on  the  same  processor  they  are  dependent. 

The  most  important  parameter  characterizing  a  CB  is  the  time  required  to  process  a 
message  of  size  x,  called  the  total  service  time.  As  with  any  realistic  model,  we  consider  that 
the  packet  processing  time  has  two  components  [15]: 

•  a  fixed  service  time  that  is  independent  of  the  packet  size, 

•  an  incremental  service  time  that  is  proportional  to  the  packet  size. 

The  fixed  service  time  appears  at  almost  every  layer  of  the  communication  architecture  and 
includes  [3,  14]:  the  overhead  associated  with  memory  management,  interrupt  processing  and 
context  switching,  and  the  propagation  delay  of  a  packet  on  the  communication  network. 

The  incremental  service  time  is  mainly  due  to  [3,  14,  15]:  data  movement  between  different 
protocol  layers,  building  CRC  (or  checksum)  when  the  packet  is  sent  and  verifying.it  when  the 
packet  is  received,  transmission  of  the  packet  on  the  communication  network. 

As  an  example,  consider  a  distributed  application  consisting  of  three  processes  running  on 
different  workstations  connected  by  a  communication  network.  Assume  that  each  workstation 
has  a  pneral  purpose  processor  that  runs  the  operating  system  and  user  applications,  and 
a  special  I/O  processor  that  sends/receives  data  to/from  the  communication  network  (the 
network  adapter).  The  corresponding  CG  of  this  system  is  presented  in  Figure  1,  where 
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terminal  nodes  are  represented  by  circles  and  CBs  are  represented  by  boxes.  Each  workstation 
is  represented  by  two  CBs,  one  that  runs  on  the  main  processor  (boxes  labeled  CBxu  CB21, 
CB31)  and  represents  the  communication  protocol  functions  performed  by  the  operating  system 
and  the  application  (e.g.,  network  layer  and  the  upper  layers  of  ISO/OSI  standard),  and  the 
other  that  represents  communication  protocol  functions  performed  by  the  network  adapter 
{CB12,  CB22,  CB32).  We  also  represent  the  communication  network  as  a  communication  block, 
labeled  CBc,  for  which  the  fixed  service  time  is  the  delay  introduced  by  the, communication 
network  and  the.  incremental  service  time  is  the  average  time  required  to  send  one  byte  of  data 
(called  transmission  time).  The  incremental  service  time  includes  the  overhead  generated  by 
the  protocol  layers  to  ensure  the  reliability  (e.g.,  acknowledgment  packets). 


Figure  1:  Communication  graph  (CG)  for  three  processes  running  on  different  workstations. 
Each  terminal  node  (process)  contains  the  list  of  messages  it  sends  (node  1  sends  mi  to  2  and 
m'l  to  3,  node  2  sends  m2  to  3  and  node  3  sends  m3  to  2).  Each  edge  is  labeled  with  the  list  of 
messages  that  flow  in  that  direction. 


Now,  let  us  consider  a  CB  characterized  by  the  following  parameters:  the  maximum  packet 
size  p  (bytes),  the  fixed  service  time  per  packet  a  and  the  incremental  service  time  per  byte 
m.  Then  the  total  service  time  t  for  a  message  of  size  x  is  given  by  the  following  equation: 


t{x]  a,  m,p)  =  of—]  +  mx. 


where  fx/p]  is  the  number  of  packets  of  maximum  size  p  being  processed.  For  convenience  we 
rewrite  (1)  as: 


t(x;a,m,p)  =  a  •^(x,p)+  (-  +  m)  -x. 


where  ^(x,p)  =  [x/p]  —  xjp  =  (pfx/p]  —  x)/p  is  a  value  between  0  and  1.  Observe  that  for 
X  0,  t{x-,a,m,p)  a  and  for  x  -v  00  (i.e.,  x  >  p)  the  first  term  from  (2)  can  be  neglected, 
i.e.,  t(x;  a,m,p)  ~  (a/p  +  m)  •  x.  Using  these  observations  we  approximate  the  total  service 
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Figure  2:  The  total  service  time  t(x;a,  m,  p)  versus  the  continuous  function  T(x;  a,  b)  used  to 
approximate  it  (a  =  1,  m  =  0.5  and  p  =  2). 


time  t  with  the  following  monotonically  increasing  continuous  function  defined  on  the  interval 
[0,  oo)  (Figure  2): 


+  (3) 

where  b  =  a/p  +  m.  This  is  the  equation  of  a  hyperbola  in  the  {x,t)  plane,  with  a  horizontal 
tangent  and  intercept  a  at  a:  =  0,  and  an  asymptote  of  slope  b;  hence,  the  name  of  the  model. 

The  improvement  of  (3)  over  a  simple  latency  (a)  /  reciprocal  transfer  rate  (/?)  model, 

T{x]a,P)  =  a  +  Px,  (4) 

is  not  so  much  in  the  fit  of  a  continuous  curve  to  the  sawtooth  form  of  a  packetized  transmission, 
but  in  the  analytical  simplicity  with  which  the  parameters  (a,  6)  for  a  CG  may  be  derived 
in  terms  of  its  elemental  CBs,  as  shown  by  the  four  combination  rules  in  subsections  2.1 
through  2.4.  Using  Ti  to  estimate  the  total  service  time  required  by  CBi  (now  characterized 
by  parameters  a,'  and  6,'  as  CBi(^a{,  6,))  to  process  a  message  of  a  given  size,  we  derive  rules  for 
reducing  n  CBs  interconnected  in  various  structures  to  a  single  equivalent  CB,  with  service 
time  T(ai, 61,02, 62, ...,a„,6„). 

Although  until  now  we  have  considered  only  a,  fixed  and  incremental  service  time  per  packet, 
the  model  can  accommodate  an  additional  fixed  service  time  per  message.  This'is  useful  in 
cases  where  the  first  packet  of  the  message  has  a  higher  fixed  service  time  than  all  subsequent 
packets.  As  an  example,  consider  a  network  with  wormhole  routing;  when  the  first  packet 
of  the  message  is  sent  a  route  is  chosen  between  source  and  destination,  and  all  subsequent 
packets  of  the  same  message  are  sent  on  the  same  route.  Let  us  denote  by  a^^^  the  fixed  service 
time  associated  with  the  first  packet  and  by  ol^l  the  fixed  service  time  associated  with  all 
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ai  =  ai  +  a2 
bi  =  max(bi,  b2) 


ao  =  ai  +  a2 
bo  =  bi  +  b2 


Figure  3:  The  equivalence  transformation  for  two  serial  interconnected  independent  CBs  (right 
top)  and  dependent  CBs  (right  bottom). 


subsequent  packets  of  the  same  message.  The  corresponding  CB  has  the  following  parameters: 


P 


where  p  is  the  packet  size  and  m  is  the  incremental  service  time  per  data  unit.  In  this  case  the 
fixed  service  time  associated  with  the  message  is 


2.1  Serial  Interconnection 


Definition  1  Two  communication  blocks  CBi{ai,bi)  and  CB2{a2M)  ore  serially  intercon¬ 
nected  with  respect  to  a  message  m  if  every  packet  of  message  m  is  processed  first  by  CB\  and 
next  by  CB2,  or  first  by  CB2  and  next  by  CBi. 

Notice  that  this  definition  does  not  imply  that  a  message  is  processed  in  its  entirety  by 
one  CB  and  only  after  that  by  the  other  CB  .  In  fact,  if  the  message  is  long  (greater  than  the 
maximum  packet  size)  and  the  CBs  are  independent,  as  soon  as  CBi  delivers  a  packet,  CB2 
can  start  to  process  it.  In  other  words  (see  Figure  3),  while  CB2  processes  the  packet  most 
recently  delivered  by  CB\ ,  CB\  processes  the  next  packet  from  its  input  message. 

Next,  we  show  how  to  transform  this  serial  structure  into  an  equivalent  CB  which  has  as 
input  the  input  of  CB\  and  as  output  the  output  ol  CB2.  To  determine  the  CB  parameters 
we  consider  two  cases: 

1.  Independent  CBs  .  In  this  case,  CB\  and  CB2  run  on  different  processors  and  therefore, 
as  we  have  pointed  out,  they  can  concurrently  process  a  long  message.  It  is  easy  to  see  that 
when  X  — >  00  the  dominant  term  in  the  total  service  time  is  max(6ix,62®)>  due  to  the  fact 
that  either  CBi  waits  for  CB2  to  process  the  previous  packet  or  CB2  waits  for  CB\  to  deliver 
a  new  packet.  On  the  other  hand,  when  x  -+  0,  the  whole  message  fits  in  a  single  packet 
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and  therefore  CB^  cannot  begin  processing  until  CBi  finishes  processing.  Since  the  individual 
service  times  are  Ci  for  CBi  and  02  for  CB^,  it  is  clear  that  the  total  service  time  for  CB  is 
Oi  +  02-  Hence,  we  obtain  the  following  parameters  for  the  equivalent  CB  : 

a/  =  Cl  +  02;  6/  =  max(6i,62)- 

2.  Dependent  CBs  .  Here,  is  not  possible  for  CBi  and  CB2  to  run  concurrently  (i.e.  CBi  and 
CB2  use  a  non-sharable  common  resource  during  the  processing).  This  is  not  different  from 
previous  case  for  a;  — >  0  (the  total  service  time  is  also  a\  +02)1  but,  since  no  processing  overlap 
is  possible,  the  total  time  service  for  long  messages,  i.e.  when  x  ->  00,  becomes  6ix  +  62X.  This 
gives  us  the  following  CB  parameters: 

=  Uj  +02)  —  hj  +  62. 

Now,  we  can  easily  generalize  our  results  by  giving  the  following  rule: 

Rule  1  (Serial  Interconnection)  Given  n  serially  interconnected  communication  blocks 

CBi(a{,bi),  1  <  i  <  n,  this  structure  is  equivalent  to  a  single  communication  block  CB{a,b), 
where: 


a  =  J2ai-,  6  =  inax(6i,62,...,6„)  (5) 

1  =  1 

if  all  CBs  are  independent,  and 

«  n 

a  =  'Eaii  b  =  ^bi  (6) 

»=i  1=1 

if  all  CBs  are  dependent. 

To  illustrate  the  use  of  rule  1,  consider  a  workstation  modeled  by  three  CBs  : 

•  CBa{aa,ba)  -  models  the  total  service  time  at  the  application  level  (e.g.,  suppose  the 
application  makes  an  extra  copy  to/from  an  internal  buffer); 

•  CBgj(ao,,bog)  -  models  the  total  service  time  due  to  the  communication  protocol  func¬ 
tions  performed  by  the  operating  system; 

•  CBc{ac,bc)  -  models  the  total  service  time  due  to  communication  protocol  functions 
performed  by  the  network  adapter.  The  Cc  represents  the  time  interval  required  to  get 
access  to  the  communication  network  (this  is  influenced  by  the  medium  access  control 
mechanism  [16]),  while  the  b^  is  the  time  required  to  send  one  data  unit.  The  inverse 
of  be  corresponds  to  the  available  communication  network  bandwidth.  The  transmission 
delay  (the  time  interval  required  to  send  one  data  unit  from  source  to  destination  on 
the  communication  network)  is  ignored  in  this  case  as  being  much  less  than  the  other 
communication  parameters. 

As  in  the  previous  example,  assume  that  the  general  purpose  processor  runs  the  operating 
system  and  the  user  processes,  while  the  network  adapter  performs  only  specific  communication 
network  functions.  We  can  reduce  this  structure  by  applying  rule  1  twice:  first  reduce  the  serial 
interconnected  dependent  blocks  CB^,  and  CB^s  (CB,  and  CBo,  are  dependent  because  they 
run  on  the  same  processor)  to  CB',  and  next  reduce  the  serial  interconnected  independent 
blocks  CB  and  CB^  to  CB{a,b).  It  is  easy  to  verify  that  after  these  reductions  we  obtain  the 
following  CB  parameters:  a  =  a^  +  a^s  +  Cc,  m  =  max(6a  +  bos,  be). 
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X 


Figure  4:  The  equivalence  transformation  for  two  parallel  interconnected  dependent  CBs  (bot¬ 
tom  left)  and  independent  CBs  (bottom  right). 

2.2  Parallel  Interconnection 

Definition  2  Two  communication  blocks  CB\  and  CB2  are  parallel  interconnected  with  respect 
to  a  message  m  if  every  packet  of  that  message  can  be  processed  either  by  CB\  or  CB2 . 


Figure  4  shows  two  parallel  interconnected  communication  blocks.  For  this  type  of  intercon¬ 
nection  we  cLSsume  that  the  packets  are  processed  in  the  way  that  minimizes  the  total  service 
time  of  the  message.  As  before,  we  take  into  account  two  cases: 

1.  Independent  CBs  .  Let  us  denote  by  x  the  total  size  of  the  message  m.  According  to 
our  assumption,  if  x  — >  0  it  is  clear  that  the  total  service  time  is  minimum  when  the  input  is 
entirely  processed  by  C\  if  cj  <  02,  or  by  C2  otherwise.  When  ®  00,  the  total  service  time 

is  minimized  when  the  splitting  of  x  ensures  a  equal  load  for  both  CBi  and  CB2.  Denoting 
by  xi  and  X2  the  sizes  of  the  inputs  processed  by  CBi  and  by  CB2  respectively,  it  is  easy 
to  see  that  load  balancing  is  achieved  when  xj  =  xb2/{bi  -f  62),  X2  =  xbi/(bi  -f  62).  Finally, 
combining  either  of  these  solutions  with  the  asymptotic  expression  of  the  total  service  time  of 
CB,  r(x;  o,  b)  =  bx  for  x  00,  we  obtain  the  overall  CB  parameter  set: 


0/  =  min(ai,a2); 


bik 
bi  + 


(7) 


2.  Dependent  CBs  .  Since  both  CBs  run  on  the  same  processor  it  is  obvious  that  we  can 
minimize  the  service  time  by  simply  choosing  the  best  parameters  in  each  case  (e.g.  for  x  — 0 
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Figure  5:  The  total  service  time  T(x;  2,  2)  for  the  resulting  CB  after  the  equivalence  transfor¬ 
mation  of  two  dependent  parallel  interconnected  CBs  with  total  service  time  given  by  T(x;  2, 
S)  and  T(x;  S,  2),  respectively. 

we  choose  the  CB  that  has  the  minimum  fixed  service  time,  while  for  i  — oo  we  choose  the 
CB  that  has  the  minimum  incremental  service  time),  which  gives  us  the  following  results: 

CD  =  min(ai,a2);  bo  =  min(6i,62)  (8) 

More  generally,  it  can  be  shown  that: 

Rule  2  (Parallel  Interconnection)  Given  n  parallel  interconnected  communication  blocks 
CBi{ai,bi),  I  <  i  <  n,  this  structure  is  equivalent  to  a  single  communication  block  CB{a,b) 
where: 


a  =  mm 


(ai,a2, . .  .,an);  t  -  ^  ^ 

°  i=i 


if  all  CBs  are  independent,  and 

a  =  min(ai,a2, .  •  .j^n);  h  =  min(6i,62)  •  •  ^n); 


(9) 


(10) 


if  all  CBs  are  dependent. 

Figure  5  shows  the  total  service  time  functions  of  two  dependent  parallel  Interconnected 
CBs  and  of  the  resulting  CB,  after  the  equivalence  transformation.  The  initial  CBs  have  the 
communication  parameters  a  =  2,  6  =  3  and  c  =  3,  6  =  2  respectively.  According  to  the 
above  rule,  in  this  case  (dependent  interconnection),  the  equivalent  CB  has  as  parameters 
a  =  min(2,3),  b  =  min(3,2).  At  the  limits,  the  total  service  time  function  of  the  equivalent 
CB,  T(x;2,2),  approaches  asymptotically  the  better  of  the  service  time  functions  of  the  initial 
CBs,  i.e.  for  a:  0  approaches  T(x;  2, 3),  while  for  x  -*■  oo  approaches  T(x;  3, 2). 
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Before  turning  to  the  more  difficult  case  of  concurrent  message  processing,  we  summarize 
the  results  of  serial  and  parallel  interconnection  on  independent  and  dependent  CBs.  In  the 
small  message  limit  that  governs  the  o  parameter,  CBs  in  serial  combine  additively  and  CBs 
in  parallel  combine  by  taking  the  minimum.  In  the  large  message  limit  that  governs  the 
h  parameter,  CBs  in  serial  that  are  dependent  combine  like  resistors  in  series,  and  CBs  in 
parallel  that  are  independent  combine  like  resistors  in  parallel.  The  other  two  subcases  obey  a 
maximum  (serial,  independent)  or  a  minimum  (parallel,  dependent)  law  in  deriving  the  overall 
6.  No  Approximations  are  necessary  in  deriving  these  rules. 

2.3  Concurrent  Message  Processing 

Until  now  we  have  considered  processing  of  individual  messages  of  a  given  size.  In  this  section 
we  analyze  the  general  case  in  which  a  CB  receives  n  messages  mi,  m2,  . . .,  m„  of  sizes  xi, 
X2,  ajn  to  be  processed  (Figure  6).  We  assume  that  CB, processes  mi,  . . rrin  messages 
using  an  arbitrary  policy,  i.e.,  first  processes  m,j,  next  m,j,  and  last  m,„  (where  I’l,  . , ,,  is  a 
permutation  of  1, . . .,  n).  Therefore,  we  cannot  tell  exactly  how  long  it  takes  for  CB  to  process 
a  message  m,-  in  the  presence  of  other  messages,  but  we  know  the  corresponding  total  service 
time  for  each  m,-  if  they  are  processed  alone  (3).  Now  let  us  consider  that  x,-  — >  0,  (i  =  1, . . , ,  n). 
In  this  case  every  message  takes  the  same  amount  of  time  a  to  be  processed  and,  therefore, 
that  the  total  service  time  for  all  messages  is  na.  Next,  take  x,-  —*■  00.  To  compute  the  total 
service  time  we  assume  for  simplicity  that  messages  are  processed  sequentially  without  delays 
and  therefore  the  total  service  time  is  given  by  the  following  equation: 

n 

t(xi,X2,...,Xn,a,b)  =  b-Y,Xi-  (11) 

»=1 


Xi  X2 .  .  .  Xi .  .  .  Xn 

ii  i  i 


1  1  I 

xi  X2  . . .  Xi .  . ,  Xn 


r 


.  -  w  - . 

CBi 

(ai,  bi) 

I 


ai  =  n  a 

bi  =  b  (xi  +  X2  +  . . .  +  Xn)  /  Xi 


Figure  6:  The  equivalence  transformation  in  the  concurrent  message  processing  case. 

Since  we  cannot  teU  exactly  when  a  particular  message  mi  is  processed,  we  consider  the  time 
required  to  process  m,-  being  bounded  by  the  time  required  to  process  all  messages,  i.e.  equiv¬ 
alent  to  the  case  in  which  m,-  is  the  last  message  being  processed.  According  to  the  previous 
limit  conditions  we  can  write  the  total  service  time  as: 

T{xi\X,n-,a,b)  =  -^^^  +  bX,  (12) 

where  X  =  is  the  total  amount  of  information  processed  by  CB  .  The  “x,|A',n” 

notation  indicates  that  the  message  of  size  x,-  is  processed  concurrently  with  other  n  -  1 
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messages  of  total  size  X  —  i,-.  To  be  consistent,  when  X  =  x,-  (which  implies  also  n  =  1)  we 
remove  “|X,n”  from  the  notation.  Then,  with  the  notation  a'  =  na  and  b'  =  bX/xi,  we  write 
(12)  as: 


T(xi-,a,b) 


a 


n 


a'  +  b'xi 


+  b'xi 


By  associating  (13)  with  (3),  we  can  state  the  third  rule: 


(13) 


Rule  3  (Concurrent  Processing)  A  communication  block  CB{a,b)  that  processes  n  mes¬ 
sages  mi,  7712,  •••,  of  sizes  xi,  X2,  Xn,  respectively,  is  equivalent  to  a  structure  of 
n  communication  blocks  CBi{ai,bi),  CB2{a2,b2),  ...,  C'Bn(an,6„),  where  CBi  independently 
processes  the  message  m,  and  has  parameters: 


a,  =  na; 


hi 


=  6- 


Er=i  X.- 

Xi 


(14) 


For  the  particular  case  in  which  all  messages  have  the  same  length  we  obtain  b'  =  bn  (both 
parameters  a  and  b  are  scaled  with  the  same  value  n).  For  a  random  order  of  messages,  a,-  is 
pessimistic  by  only  a  factor  of  two  on  average. 

As  an  illustration  of  applying  this  rule  (and  of  the  first  rule)  we  take  a  simple  example.  As 
depicted  in  Figure  7,  suppose  we  have  five  processes  (numbered  from  1  to  5)  running  on  different 
machines.  Each  machine  is  represented  by  a  CB  that  includes  all  the  communication  protocol 
functions  (implemented  by  the  application,  the  operating  system  and  on  the  network  adapter). 
Further,  assume  that  each  of  the  processes  1  and  3  sends  a  message  to  process  4,  while  process  2 
sends  one  message  to  processor  5.  The  question  is  to  determine  the  total  service  time  to  send  the 
message  mi  from  1  to  4.  To  answer  this  question  we  reduce  the  initial  CG  (Figure  7(a))  in  two 
steps.  First,  applying  rule  3  to  CBc  CB4  we  obtain  an  intermediate  structure  consisting 
of  three  serial  interconnected  independent  communication  blocks  CBi{ai,bi),  CB[.(ac,bc)  and 
CB'4{a'4,b'4)  (Figure  7(b)),  such  that  a'  =  Za^  6'  =  bc{xi  +  X2  +  xa)/®!,  <  =  204,  b'^  = 
hiixi  +  X3)/xi.  Next,  using  rule  1  this  structure  is  reduced  to  the  final  structure  consisting 
of  a  single  communication  block  CB  (Figure  7(c))  with  parameters  a  =  ci  +  a'  +  and 
b  =  max(6i,6',64),  which  are  finally  used  to  compute  the  total  time  to  deliver  mi  from  process 
1  to  process  4  by  substitution  into  equation  (3). 


2.4  The  General  Reduction  Rule 

Thus  far,  we  have  implicitly  assumed  that  the  communication  graph  is  the  same  for  small  and 
large  messages.  Although  this  is  true  for  many  cases,  for  complex  communication  patterns  this 
assumption  is  no  longer  valid.  As  an  example,  we  will  show  that  for  the  broadcast  implemen¬ 
tation  (section  3.1)  based  on  the  binary  tree  topology  for  small  messages  we  can  ignore  the 
contention  on  a  shared  communication  network  if  the  transmission  time  is  orders  of  magnitude 
less  than  the  sending  and  receiving  overhead,  while  for  large  messages  the  contention  cannot 
be  ignored.  In  consequence  the  CG  will  be  different  for  small  and  large  messages.  In  this  case 
the  following  general  reduction  rule  may  be  used: 

Rule  4  (General  Reduction)  Given  two  terminal  nodes  s  and  d  such  that  s  sends  a  message 
m  of  size  x  to  d,  then  the  total  service  time  for  the  message  m  is: 
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mi(xi,4)  m2(x2,5)  m3(x3,4)  mi(xi,4) 


miCxi,  4) 


Q 


a  =  ai  +  3ac  +  2a4 
b  =  max(bi, 

bc(xi  +  X2  +  X3)/xi, 
b4(xi  +X3)/xi) 


c) 


Figure  7:  Computing  the  estimated  time  for  the  message  m\  to  he  delive^d  to  the  process  4  by 
successive  reductions  of  the  communication  graph. 
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where  a  is  the  service  time  when  sending  a  small  message  from  s  to  d  (x  -*  0),  and  b  is  the 
service  time  per  data  unit  when  sending  a  large  message  from  s  to  d  (x  —*  oo). 

The  parameters  a  and  b  can  be  computed  by  using  rules  1-3  to  reduce  the  corresponding 
CGs.  Notice  that  the  general  reduction  rule  is  equivalent  to  reducing  the  paths  along  which 
message  m  may  travel  between  source  and  destination  (called  m-communication  paths)  to  a 
single  communication  block  CBq  with  parameters  a  and  b. 

2.5  Communication  Time  Measures 

When  a  message  is  sent  between  two  end  processes,  represented  as  terminal  nodes  in  CG,  three 
measures  are  particularly  important: 

•  the  total  time  interval  between  sending  the  message  (by  the  source  process)  and  delivering 
it  (to  the  destination  process),  called  total  communication  time  (Tc).  As  we  have  shown 
in  the  previous  section,  by  applying  the  general  reduction  rule,  Tc  can  be  computed  as 
the  total  service  time  of  the  resulting  GBq. 

•  the  time  spent  by  sender  while  sending  the  message,  called  sending  time  (Tg). 

•  the  time  spent  by  receiver  while  receiving  the  message,  called  receiving  time  (TV ). 

To  determine  the  T,  and  T,,  we  need  to  take  a  closer  look  at  the  sending  and  receiving 
mechanisms.  First,  let  us  consider  all  paths  between  source  and  destination  along  which  a 
message  travels.  Next,  using  the  equivalence  transformation  rules,  we  reduce  all  paths  to  a 
single  path  cont^ning  only  independent  GBs  :  CBi,  GB2,  ...,  CJ9„  (we  consider  that  this 
is  always  possible),  where  the  source  process  runs  on  the  same  processor  as  GBi  and  the 
destination  process  runs  on  the  same  processor  as  GB„.  Now,  let  us  analyze  the  mechanism  of 
sending  a  message  from  source  to  destination  along  the  equivalent  path.  The  discussion  here  is 
similar  to  the  serial  interconnection  of  independent  GBs  .  If  the  message  is  large,  i.e.  it  consists 
of  a  large  number  of  packets,  then  the  message  is  concurrently  processed  by  the  independent 
GBs  in  a  pipeline  fashion.  As  we  have  shown,  in  this  case  the  processing  speed  is  determined 
by  the  slowest  GB  .  From  here  results  that  if  GB\  is  not  the  slowest  communication  block, 
then  after  it  processes  a  packet  it  must  wait  a  certain  amount  of  time  in  order  to  deliver  the 
next  packet  to  GB2- 

In  our  model  we  define  T,  either  as  the  time  required  to  process  aU  message  packets  by  GBi , 
or  as  a  time  interval  between  starting  processing  of  the  first  packet  and  the  delivery  of  the  last 
packet  to  GB2.  In  the  first  case,  the  send  primitive  is  said  to  be  preemptive,  while  in  the  latter, 
the  send  primitive  is  said  to  be  non-preemptive.  When  a  preemptive  send  primitive  is  used  the 
control  is  returned  to  the  caller  process  as  soon  as  the  send  operation  is  initiated  and  further 
computation  can  be  performed  concurrently  with  the  processing  required  to  send  the  message. 
When  a  non-preemptive  send  primitive  is  used  the  caller  process  is  blocked  from  the  moment 
of  calling  the  send  primitive  until  the  last  packet  of  the  message  is  delivered  to  GB2.  Since 
our  main  focus  is  to  determine  the  real  processing  time  spent  by  a  GB  in  sending/receiving  a 
particular  message  we  prefer  to  use  the  terms  preemptive  and  non-preemptive  to  characterize 
the  communication  primitives,  rather  than  redefining  overloaded  terms,  such  as  blocking/non¬ 
blocking  and  synchronous/asynchronous,  which  are  usually  used  to  define  the  semantics  of 
the  communication  primitives.  Differences  between  various  types  of  communication  primitives 


14 


(see  [5]  for  an  extensive  discussion  and  formal  treatment)  are  ultimately  captured  in  the  CB 
parameters.  What  is  important  from  the  point  of  view  of  performance  evaluation  is  the  extent 
to  which  concurrent  processing  by  the  application  process  and  its  neighboring  CB  is  allowed. 

As  an  example  of  a  preemptive  send  primitive,  let  us  consider  a  single  processor  work¬ 
station  that  runs  a  preemptive  operating  system  (e.g.,  UNIX).  We  roughly  describe  how  the 
send  primitive  may  be  implemented.  When  the  application  process  (that  runs  on  the  same 
processor  as  the  operating  system)  invokes  the  send  primitive,  the  first  packet  of  the  message 
is  processed  and  delivered  to  the  CB^.  Next,  the  control  is  returned  to  the  caller  process,  which 
can  proceed  with  its  computations.  After  CB<i  processes  and  delivers  the  current  packet,  it 
asks  CB\  for  the  next  packet  to  be  sent  (usually,  this  is  done  using  an  interrupt  mechanism). 
In  turn,  the  application  process  is  interrupted  and  the  next  packet  is  processed  and  delivered 
to  CB2.  This  procedure  continues  until  the  last  packet  of  the  message  is  sent  out.  If  we  neglect 
the  interrupts  and  operating  system  calls  overhead,  then  it  is  clear  that  T,  is  the  total  time 
required  by  CBi  to  process  all  the  packets  of  the  message. 

In  the  case  of  the  non-preemptive  send  primitive  implementation,  after  the  first  packet 
of  the  message  is  processed  and  delivered,  CBi  waits  to  deliver  the  next  one.  Therefore  the 
sender  process  is  blocked  until  the  last  packet  of  the  message  is  delivered  to  CB2. 

To  determine  we  consider  several  cases  (see  Figure  8): 

•  if  the  message  is  small,  i.e  it  fits  in  one  packet,  we  take  T,  equal  to  CB\  service  time 
TcB\  for  both  preemptive  and  non-preemptive  send  primitives.  This  is  equivalent  to  con¬ 
sidering  that  when  the  send  primitive  is  invoked,  the  message  is  processed  and  delivered 
in  only  one  packet  to  CB2  and  then  the  control  is  returned  to  the  application  process. 

•  if  the  message  is  large  and  a  non-preemptive  send  primitive  is  used,  then  it  is  easy  to  see 
that  the  total  communication  time  Tc  is  equal  to  T,  plus  the  time  required  by  the  last 
message  packet  to  be  delivered  to  the  destination  process,  i.e.  TcBn-  Therefore  we  can 
take  as  an  upper  bound  for  T,  the  total  communication  time  Tc. 

•  if  the  message  is  large  and  a  preemptive  send  primitive  is  used,  then  T,  accounts  for  the 
total  time  required  to  process  and  deliver  all  the  packets  of  the  message  by  CBi  and  thus 
we  take  Tj  equal  to  Tcbi  ■ 

Although  we  have  considered  very  simple  send  primitive  implementations,  the  model  can 
accommodate  more  complicated  implementations.  As  an  example,  let  us  assume  that  the 
communication  protocol  requires  that  the  receiver  to  be  informed  about  the  size  of  the  message 
before  the  message  is  actually  sent  (in  order  for  the  receiver  to  allocate  memory  space  for  the 
incoming  message).  Moreover,  consider  that  this  implementation  is  based  on  exchanging  two 
messages:  one  to  inform  the  receiver  about  the  size  of  the  message  and  one  to  acknowledge 
that  the  bulfer  has  been  allocated  and  the  sender  can  proceed.  This  case  can  be  modeled  by 
adding  a  new  independent  communication  block  before  CBi,  called  CBq,  with  the  following 
parameters:  a,  equal  to  the  average  time  required  to  exchange  the  two  messages  plus  the 
overhead  to  allocate  the  memory  at  the  receiver  and  possibly  other  interrupt  and  system  calls 
overheads,  and  6  =  0. 

As  another  example,  let  us  assume  that  for  a  non-preemptive  send  primitive  implementa¬ 
tion  the  communication  protocol  requires  that  every  packet  be  acknowledged  by  the  receiver. 
In  this  case  we  can  add  a  new  communication  block  CB'-^  after  CBq  which  has  as  parameters: 
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preemptive 

non-preemptive 

short  message  (x  —*  0) 

Tcbi  (x;  oi,  61) 

Tcs,  (i;oi,fci) 

long  message  (z  — ►  00) 

TcBi{x\a\,hi) 

Tc(x) 

preemptive 

non-preemptive 

short  message  (z  — ►  0) 

7cB.{z;a„,6„) 

TcBn{x\an,bn) 

long  message  (z  — *  00) 

TcBn{x;an,b„) 

TcBn  (*:  On,  bn)  <Tr  <  Tc{x) 

Figure  8:  Sending  (T3)  and  receiving  time  (Tf)  expressions,  where  message  size  is  x. 


a,  equal  to  the  average  time  interval  to  receive  the  acknowledge  from  receiver  by  the  sender, 
and  6  =  0. 

Now,  let  us  concentrate  on  the  receiving  time  Tf.  Since  we  are  not  interested  here  in  the 
synchronization  time,  we  consider  that  the  receive  primitive  is  called  at  the  same  time  the 
first  packet  of  the  message  is  received  by  Similarly  to  Tc,  is  defined  either  as  the 

time  required  to  process  aU  packets  of  a  message  by  CBn  (preemptive  receive  primitive),  or 
as  a  time  interval  between  the  beginning  of  processing  of  the  first  packet  and  the  finishing  of 
processing  of  the  last  packet  from  the  message  (non-preemptive  receive  primitive).  The  Tj. 
analysis  is  the  same  as  Tg  analysis  for  small  messages  and  for  large  messages  when  preemptive 
receive  primitive  is  used  (see  Figure  8).  The  major  difference  is  when  we  consider  large 
messages  and  preemptive  receive  primitives.  Unlike  the  preemptive  send  primitive,  where 
after  the  first  packet  is  sent  the  application  process  can  proceed,  when  receiving  we  assume 
that  the  application  cannot  proceed  until  the  last  packet  of  the  message  is  received  (in  other 
words,  the  message  is  not  passed  to  the  application  process  until  it  is  completely  received)  and 
thus  we  take  T,  equal  to  Tg.  On  the  other  hand,  if  more  than  one  message  is  received  at  the 
same  time,  the  waiting  time  between  processing  packets  from  the  same  message  can  be  used  to 
process  packets  from  other  messages  and  therefore,  in  the  limit,  we  can  take  %  equal  to  Tcb„- 

Although  in  Figure  8  only  the  expressions  of  T,  and  Tr  for  the  extreme  message  sizes 
(x  — >  0,  X  -+  00)  are  given,  we  can  use  again  the  equation  (3)  to  approximate  T,(x;a,,63) 
and  Tr(x;ar,br)  for  any  message  size  x,  where  a,  =  r,(x  0),  a,  =  TJx  0)  and  6,  = 
lini  ZiM  6  —  lim 

uiiix— »oo  ^  “  uuix— >00  3. 

3  Common  Communication  Patterns  in  Parallel  Applications 

In  this  section  we  give  some  examples  of  how  the  model  can  be  used  to  analyze  four  archety¬ 
pal  communication  patterns  encountered  in  parallel  appbcations:  broadcast,  synchronization, 
global  reduction,  and  nearest  neighbor  communication. 

We  consider  a  network  of  homogeneous  workstations  interconnected  by  a  communication 
network.  Each  workstation  is  represented  by  a  communication  block  CBw,  while  the  com¬ 
munication  network  is  represented  by  a  communication  block  CBc-  Also,  when  a  message 
is  received,  a  communication  block  CBi  is  added  between  the  communication  network  CBc 
and  the  receiver  CBw-  The  role  of  CBw  is  to  capture  the  message  processing  overheads  at 
send  and  receive  (here  we  assume  that  the  send  and  receive  processing  overheads  are  equal). 
CBq  captures  the  communication  network  bandwidth  (l/6c)  and  the  possible  delay  before  the 
first  bit  of  the  packet  is  sent  on  the  network  (Cc).  Finally,  CBi  captures  the  communication 
delay  L  (ai,  =  i,  6l  =  0).  The  send  and  receive  primitives  are  considered  non-preemptive. 
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Figure  9  shows  the  communication  graph  for  a  message  transmission  between  two  processors. 


WSi  WS2 


Figure  9:  The  communication  graph  for  sending  one  message  from  process  1  to  process  2. 
Observe  that  the  CBi  appears  in  the  communication  path  only  between  communication  network 
(CBc)  and  workstation  communication  block  CBw- 


3.1  Broadcast 

The  broadcast  primitive  ensures  the  delivery  of  a  message  from  one  processor  to  N  other  pro¬ 
cessors.  We  consider  two  broadcast  implementations.  First,  a  binary  tree  is  used  to  broadcast 
the  message  from  a  root  processor  to  all  other  processors  as  indicated  in  Figure  10.  For  sim¬ 
plicity,  we  assume  that  every  node  of  index  i  sends  the  message  first  to  the  left  child  {2i  -1-1) 
and  next  to  the  right  child  (2t  -1-  2).  We  are  interested  in  determining  the  total  time  required 
to  complete  the  broadcast,  i.e.  from  the  moment  when  the  root  begins  the  transmission  of  the 
first  message  to  the  moment  when  the  message  is  received  by  the  last  processor.  As  usual, 
two  extreme  cases  are  considered:  the  message  is  very  small  (x  -+  0),  and  the  message  is  very 
large  (x  00).  For  small  messages  we  assume  that  sending  and  receiving  overheads  aw  are 
much  larger  than  the  actual  transmission  time  ac  -f-  bcx  and  therefore  we  do  not  address  the 
situations  in  which  more  than  one  processor  sends  the  message  on  the  communication  net¬ 
work  at  the  same  time.  With  this  assumption  it  is  easy  to  see  that  the  communication  time 
between  any  two  processors  is  Tc  =  2aw  -1-  ac  +  “l,  while  the  sending  and  receiving  times 
are  T,  =  =  aw-  Back  to  our  example  (Figure  10(b)),  the  time  required  to  complete  the 

broadcast  is  Sow  -1-  3oc  +  Sox,.  Generally,  for  a  complete  binary  tree  of.  height  h,  the  time  to 
complete  the  broadcast  is  h{Zaw  -1-  ac  +  ax,). 

In  the  case  of  large  messages  the  transmission  time  and  other  incremental  service  times  are 
much  larger  than  the  communication  delay  and  corresponding  fixed  service  times.  The  activity 
of  each  processor  over  time  is  depicted  in  Figure  10(c).  Since  we  consider  non-preemptive  send 
and  receive  primitives,  we  have  T,  =  Tr  =  Tc  (see  tables  in  Figure  8).  As  one  can  observe  there 
are  moments  in  time  when  more  than  one  processor  sends  a  message  on  the  communication 
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network  (e.g.  transmission  between  processor  0  and  2  takes  place  simultaneously  with  the 
transmission  between  processor  1  and  3).  If  there  are  n  processors  that  concurrently  send 
messages  of  the  same  size,  then  by  applying  rule  3  and  the  general  reduction  rule  we  obtain 
for  every  message  path  (from  our  topology  a  message  travels  along  only  one  path  between 
source  and  destination)  the  equivalent  communication  block  CBq  with  parameters:  aa  = 
2aw  +  nac  +  a/,  and  ba  =  msx(nbc,bw).  Since,  for  very  large  messages,  the  incremental 
service  time  dominates  the  fixed  service  time  we  approximate  Tc  with  box,  where  x  is  the  size 
of  the-message  being  broadcast.  Consequently,  the  time  required  to  complete  the  broadcast 
in  our  example  is  equal  to  a;(2max(6c,6H')  +  max(26c7,i>w)  +  2max(36a,*>W'))-  Now,  the 
entire  broadcast  communication  graph  can  be  reduced  to  one  communication  block  CBbcast 
with  the  following  parameters:  ascAST  =  +  3ac  +  30^,  bscAST  =  2max(6c,&w)  + 

max(26c7,  bw)  +  2  max(36c>  bw)- 


0 


a) 


b)  c) 


Figure  10:  a)  The  broadcast  binary  tree  for  11  processors,  b)  The  processor  activity  when  a 
small  message  is  broadcast.  The  sending  time  is  represented  by  empty  bars  while  the  receiv¬ 
ing  time  is  represented  by  shadowed  bars,  c)  The  processor  activity  when  large  messages  are 
broadcast. 

In  the  second  broadcast  implementation  (which  is  the  native  implementation  in  p4,  version 
1.3)  the  root  processor  simply  sends  the  message  to  every  other  processor:  1,  2,  . . .,  N.  It 
is  very  eeisy  to  see  that  in  this  case  the  time  to  complete  the  broadcast  for  small  messages  is 
(N-{-l)aw+ac+0'L  and  for  large  messages  is  JV-max(ta,  ^h')-®.  Although  this  implementation 
is  the  simplest  possible,  notice  that  if  be  >  bw,  there  is  no  other  broadcast  implementation 
to  give  better  performance  for  large  messages  (this  can  be  easily  verified  for  the  binary-tree 
broadcast  implementation).  This  is  because  in  this  case  the  communication  network  is  the 
bottleneck  for  any  number  i  of  messages  that  are  concurrently  sent  (max(ibc,bw)  =  ibc)  and 
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Figure  11:  The  estimated  Tc  versus  experimental  data  for  the  broadcast  binary  tree  implementa¬ 
tion.  The  regression  coefficient  is  0.88.  The  experiments  were  run  on  11  Sun  SparcstationELC 
workstations  interconnected  by  an  Ethernet  network  using  p4. 

therefore  the  total  broadcast  time  has  as  a  lower  bound  the  time  required  to  send  all  messages 
across  the  communication  network,  which  is  Nbc. 

The  shapes  of  the  estimated  communication  time  functions  for  the  tree-based  and  serial 
broadcasts,  together  with  experimental  measurements  of  Tc  are  shown  in  Figures  11  and  12,  re¬ 
spectively.  All  experiments  in  this  and  subsequent  sections  were  run  during  periods  of  dedicated 
time  on  up  to  16  Sun  SparcstationELC  workstations.  The  p4  package  from  Argonne  National 
Laboratory  [2]  served  as  the  application-level  communication  support.  The  model  offers  a  very 
accurate  approximation  to  the  actual  measurements.  The  regression  coefficients  are  0.88  for 
the  binary  tree  broadcast  implementation  and  0.92  for  the  second  broadcast  implementation. 
In  both  cases,  the  maximum  error  was  around  10%.  All  CB  parameters  were  experimentally 
determined  in  the  limits  of  small  or  large  message  size  and  one  or  many  processors,  using  the 
procedure  described  in  section  4. 

3.2  Synchronization  and  Global  Operation 

Both  synchronization  and  global  operation  primitives  can  be  implemented  using  the  same  com¬ 
munication  pattern.  Although  it  is  not  the  most  efficient  implementation,  we  describe  here 
the  one  used  in  p4,  version  1.3.  The  global  operation  implementation  differs  from  the  syn¬ 
chronization  in  two  respects.  First,  during  the  global  operation  the  synchronization  messages 
carry  partial  results  and  second,  besides  sending  and  receiving  messages  the  processors  are 
responsible  for  computing  partial  and  final  results.  Therefore,  the  synchronization  can  be  seen 
as  a  special  case  of  global  operation  where  no  computation  is  performed.  In  the  remaining  of 
this  subsection,  we  concentrate  on  the  global  operation  implementation. 

A  global  operation  primitive  implements  a  group  computation.  Formally,  a  group  compu- 
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Figure  12:  The  estimated  Tc  versus  experimental  data  for  native  broadcast  implementation. 
The  regression  coefficient  is  0.92. 

tation  is  defined  as  follows:  given  n  different  items  oi,  02,  . . a„  in  a  group  (where  ® 

is  a  binary  associative  and  commutative  operation  defined  on  set  S)  compute  the  final  value 
oi  ®  02  ©  •  •  <  0  On-  The  following  are  examples  of  group  computation:  finding  the  sum,  the 
maximum,  or  the  minimum  of  a  set  of  n  numbers. 

The  global  reduction  primitive  gathers  a  value  (or  a  set  of  values)  from  each  processor, 
computes  from  them  a  single  result  (or  a  single  set  of  results)  and  distributes  it  to  every 
node.  The  implementation  consists  of  two  phases,  illustrated  by  the  light  and  dark  arrows  in 
Figure  13(a),  which  is  from  the  same  incomplete  binary  tree  used  in  the  broadcast  illustration. 
In  the  first  phase,  the  tree  is  used  to  collect  the  results  from  the  leaves  toward  the  root. 
Whenever  a  node  receives  the  values  from  its  children,  it  computes  the  partial  result,  i.e. 
valnode  ©  '^O'llchild  ©  valrchUdi  ^nd  sends  it  to  the  parent.  Therefore,  after  the  root  receives  the 
partial  results  from  its  children  it  can  compute  the  final  result.  In  the  second  phase,  the  root 
distributes  the  final  result  by  sending  a  message  to  every  processor. 

Since  the  values  carried  by  the  messages  are  often  no  larger  than  8  bytes  (double  precision 
numbers),  we  assume  that  sending  and  receiving  overheads  are  much  larger  than  the  actual 
data  transmission  time  and  therefore  we  ignore  the  message  contention  on  the  communication 
network.  Also,  we  ignore  the  time  to  compute  the  partial  and  final  results  as  being  much  less 
than  the  communication  time.  From  Figure  13(b)  it  may  be  seen  that  the  total  time  required  to 
complete  the  global  operation  is  IQch' +  4ac  +  4a/,.  The  observed  error  between  the  estimated 
completion  time  and  experimental  measurements  is  about  15%.  (Since  synchronization  and 
global  reduction  operate  on  messages  of  trivial  size,  there  is  no  effective  hyperbolic  law  as  a 
function  of  message  size  to  graph  for  these  primitives.) 
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Figure  13:  a)  The  communication  pattern  used  by  the  syncronization  and  global  operation 
primitives  for  11  processors,  b)  The  processor  activity  over  time. 

3.3  Neighbor  Communication 

A  broad  range  of  scientific  algorithms  arising  from  differential  equations  require  data  to  be  sent 
from  one  processor  to  its  logical  neighbors.  As  an  example,  consider  a  domain  decomposition 
problem  ([11];  see  section  5.1)  where  each  subdomain  of  a  domain  on  which  a  partial  differential 
equation  is  to  be  solved  is  mapped  onto  a  single  processor.  At  each  iteration  of  the  algorithm 
every  processor  sends  to  its  neighbors  the  boundary  data  to  be  used  in  the  next  iteration. 

More  generally,  suppose  there  are  N  processors  and  eachvjf  them  has  K  {K  <  N)  logical 
neighbors.  Further,  we  assume  that  every  processor  sends  messages  of  the  same  length  to  each 
of  its  neighbors  at  the  same  moment  of  time.  The  latter  assumption  is  based  on  a  parallel 
application  model  in  which  every  processor  has  the  same  amount  of  work  to  perform  between 
sending  and  receiving  boundary  data. 

By  applying  the  reduction  rules  1  and  3  to  the  resulting  CG  it  is  easy  to  verify  that  the 
equivalent  CB  for  any  message  path  has  the  following  parameters:  oneighbor  =  2Aaw  + 
KNac  +  flL  and  bjqElGHBOR  =  rasx(KNbc,bw).  Figure  14  shows  the  estimated  Tc  function 
versus  experimental  measurements  with  a  regression  coefficient  of  0.92,  and  a  maximum  error 
of  17%. 

4  Experimental  Evaluation  of  CB  parameters 

I 

In  principle,  one  can  determine  CB  parameters  by  considering  a  workstation’s  physical  char¬ 
acteristics  (e.g.,  the  processor  speed,  the  memory  access  time,  the  internal  bus  speed,  etc.), 
and  the  communication  protocol  implementation  details  (e.g.,  how  many  times  a  data  buffer  is 
copied  while  passed  through  various  protocol  layers,  the  algorithms  used  to  compute  the  check¬ 
sum,  etc.).  Although  this  approach  apparently  allows  accurate  evaluation  of  CB  parameters, 
it  is  very  hard  to  apply  in  practice  because  of  several  factors: 
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Figure  14:  The  estimated  Tc  versus  experimental  data  for  neighbor  communication  pattern. 
The  regression  coefficient  is  0.92.  Here,  N  =  16  and  K  =  f. 


•  Various  layers  of  the  communication  architecture  are  embedded  in  the  general  purpose 
operating  systems  running  on  the  processing  nodes.  This  makes  them  compete  for  system 
resources  with  other  processes  in  the  multitasking  environment.  It  also  means  that 
various  factors  like  interrupt  processing,  context  switching,  memory  management,  etc., 
combined  with  hardware  features  like  the  presence  of  a  cache  memory  system,  would 
have  to  be  considered  when  trying  to  model  the  communication. 

•  Systems  may  be  heterogeneous  (made  up  of  machines  from  different  vendors,  with  dif¬ 
ferent  characteristics  and  running  different  operating  systems). 

•  Software  packages,  such  as  the  support  for  communication  between  end  processes  (at 
the  application  level),  each  having  their  own  characteristics  and  introducing  their  own 
overhead,  which  would  have  to  be  represented  in  a  detailed  model. 


We  propose  here  a  simple  approach  to  evaluate  the  CB  parameters  with  an  accuracy  whose 
acceptability  can  be  judged  by  its  fits  in  Figures  11,  12  and  14.  Let  us  consider  a  network 
of  n  identical  workstations  bnked  by  a  communication  network  CBc(ac,bc).  For  simplicity, 
assume  that  the  overhead  for  sending  and  receiving  messages  is  equal.  Thus,  all  workstations 
are  modeled  by  the  same  CB(a,b)  irrespective  of  whether  a  message  is  being  sent  or  received. 
Now,  consider  2n  workstations  numbered  from  1  to  2n,  and  let  each  odd  numbered  workstation 
send  a  message  of  the  same  length  to  the  next  even  numbered  workstation,  i.e.  2i  -  1  sends 
to  2i,  (i  =  1,2,..  .n).  If  we  take  a  pair  of  workstations  2i  -  1  and  2i  and  first  apply  rule  3  for 
CBc  and  next  rule  1  for  (75(2, -i),  CB^^i)  and  CB',.,  the  total  service  time  required  to  deliver 
the  message  from  2t  —  1  to  2i  is  given  by: 


T(x,n\a',  b') 


o'  -b  Vx 


-f  b'x 


(16) 
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where  a'  =  2a  +  noc  and  b'  =  inax(6,  bcu). 

There  are  four  parameters  to  be  determined:  a  and  b  for  the  workstation  CB  and  Uc  and 
be  for  the  network  CBc-  Theoretically,  we  can  determine  all  necessary  parameters  from  the 
following  equations: 


be 

a 

max(6,  be) 


a-  lima;-»o  T(g,n;  a',b') 


lim  —  =  lim 

n— >00  n  n— >00 

lim  —  =  lim 

n-^00  ft  n-^oo 


n 

j.  T(x,n;a',60 

iiniar— foo  X 

n 

lima:_o  T(a;,l;a',6' 

)-ac 

lim 

a?— »oo  X 


(17) 


Notice  that  the  last  equation  permits  determination  of  6,  the  reciprocal  of  the  bandwidth  of 
the  CB  associated  with  each  workstation,  only  if  it  is  larger  than  be.  If  it  is  smaller  than  be,  it 
is  unnecessary,  since  the  workstation  CB  is  then  not  the  bottleneck  in  the  large  message  limit. 
The  first  two  equations  express  the  well-known  truth  that  when  the  number  of  workstations 
increases,  the  network  becomes  the  main  bottleneck  for  the  overall  performance. 

For  the  SparcstationELCs  running  SunOS  4.1.3,  the  p4  communication  layer  version  1.3, 
and  the  Ethernet  at  ICASE,  where  the  experiments  were  performed  during  “dedicated”  wee 
hours,  the  parameters  we  obtained  and  used  in  the  “predicted”  curves  in  this  paper  are: 


Cc  =  345.60  /isec 
be  =  0.92  /isec/byte 
a  =  859.52  fisec 
b  =  1.42  ^sec/byte 


We  note  that  1/be  is  only  about  10%  slower  than  the  theoretical  peak  performance  of  Ethernet, 
virtually  the  same  performance  realization  reported  in  [17].  We  expect  the  b  parameter  of  the 
workstation  to  be  visible  only  when  there  is  low  contention,  since  it  is  within  a  factor  of  two 
of  the  reciprocal  of  6c- 


5  Tests  on  Model  Scientific  Applications 

Two  model  parallel  scientific  applications  originally  written  for  a  tightly  coupled  multiprocessor 
and  rewritten  in  p4  are  used  as  test  programs  for  the  hyperbolic  model.  A  domain  decompo¬ 
sition  (DD)  code  for  the  Poisson  problem  on  the  unit  square  and  a  multigrid  (MG)  code  for 
transient  flow  in  a  cavity  are  chosen  among  conveniently  available  codes  for  their  simplicity 
and  for  their  very  different  communication  patterns.  For  each  application,  we  first  describe 
the  algorithm  just  sufficiently  to  expose  the  leading  order  computational  and  communication 
complexity  and  to  appreciate  its  general  context,  then  we  describe  the  network  parallel  im¬ 
plementation.  Fuller  descriptions  of  the  applications  themselves  may  be  found  in  references 
[12]  and  [9].  For  each  application,  we  select  for  graphical  comparison  various  communication 
cost  estimates  and  corresponding  measurements.  The  estimates  derive  from  appropriate  com¬ 
binations  of  the  archetypal  communication  operations  described  in  section  3,  with  parameters 
evaluated  as  in  section  4. 
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5.1  A  Domain  Decomposition  Application 

5.1.1  Algorithm 

The  first  test  problem  is  a  partial  differential  equation  (Poisson’s  equation)  on  a  two-dimensional 
square  domain  with  given  boundary  conditions  and  a  forcing  term  chosen  so  that  the  solution 
is  smooth.  We  assume  uniform  gridding  and  discretize  with  the  standard  five-point  difference 
stencil.  This  generates  a  sparse  banded  system  of  linear  equations.  A  domain  decomposition 
method  using  conjugate  gradient  (CG)  iteration  is  used  to  solve  the  resulting  matrix  equation. 
The  domain  is  divided  into  uniform  square  subdomains  by  the  vertices  of  a  coarse  grid  (nested 
in  the  grid  on  which  the  problem  is  resolved),  and  by  the  edges  connecting  these  vertices. 
Altogether,  three  point  sets  are  distinguished:  the  coarse  grid  vertices  (or  crosspoints),  the 
fine  grid  points  along  the  edges  (or  interfaces),  and  the  fine  grid  interior  points.  The  union 
of  the  vertices  and  edges  is  called  the  “wire  basket.”  The  decomposition  of  the  physical  do¬ 
main  induces  a  block  structure  on  the  system  matrix.  The  CG  iteration  is  over  the  full  set  of 
unknowns. 

The  code  used  in  the  tests  was  originally  written  for  the  Intel  hypercube  by  Keyes  &  Gropp 
[12]  and  uses  a  BPS-type  wire-basket  preconditioner  [Ij.  The  preconditioning  consists  of  several 
serial  stages,  most  of  which  permit  concurrency  with  a  granularity  equal  to  the  number  of 
subdomains.  Independent  problems  on  the  subdomain  interiors  are  solved  concurrently  in  each 
CG  iteration.  The  only  communication  needed  to  set  them  up  is  in  supplying  values  along  the 
four  bounding  segments,  which  may  be  segments  of  the  physical  boundary  or  artificial  interior 
interfaces.  Independent  problems  on  the  interfaces  are  solved  once  per  CG  iteration.  The  only 
communication  required  to  set  them  up  is  in  supplying  forcing  data  along  the  interfaces  from 
adjacent  subdomain  interiors.  The  remaining  stage  is  the  solution  of  a  small  but  global  linear 
system  involving  the  coarse  grid  unknowns;  this  is  the  only  exception  (besides  load  imbalance 
due  to  boundary  effects)  to  full  concurrency  in  the  application  of  the  preconditioner.  Rather 
than  solve  this  problem  in  a  true  distributed  fashion,  which  is  cost-effective  only  for  small 
numbers  of  processors,  the  coarse  grid  problem  is  assembled  m  full,  redundantly,  on  each 
processor,  and  then  solved  serially.  Global  all-to-all  communication  is  required  to  set  up  the 
corresponding  right-hand  side,  whose  values  change  at  each  iteration,  and  whose  computation, 
in  turn,  requires  values  from  along  the  four  interfaces  that  meet  at  each  crosspoint.  The 
interior  and  interface  phases  involve  local  communication  whose  overall  volume  scales  with  both 
problem  size  and  granularity.  The  coarse  grid  phase  involves  low-bandwidth  global  broadcasts 
whose  number  (one  for  each  crosspoint)  scales  with  the  granularity,  but  not  with  the  problem 
size.  A  detailed  analysis  of  parallel  implementations  of  methods  of  this  type  on  tightly-coupled 
distributed  memory  machines,  such  as  hypercubes,  meshes,  or  rings,  is  given  in  [7].  Such 
machines  have  multiple  direct  links,  whose  number  scales  with  the  number  of  processors,  so 
that  the  local  interior  and  interface  communications  scale  perfectly. 

5.1.2  Implementation 

On  an  Ethernetwork  of  workstations,  all  communication  phases  compete  for  a  common  re¬ 
source.  The  degree  to  which  they  collide  depends  on  the  volume  and  on  the  synchronicity  of 
the  messages.  Figure  15  shows  the  most  general  communication  pattern  generated  for  an  inter¬ 
nal  subdomain  with  nj  and  rij  points  per  vertical  and  horizontal  side,  respectively.  Obviously, 
a  processor  that  holds  a  subdomain  located  on  the  physical  boundary  need  not  participate  in 
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the  complete  set  of  messages  shown.  The  numbers  assigned  to  incoming  or  outgoing  messages 
define  the  order  of  communication  operations,  as  imposed  by  the  data  dependencies  in  the 
algorithm.  Each  message  is  labeled  with  its  type  and  the  number  of  data  elements  (floating 
point  values)  carried. 

One  all-to-aU  set  of  broadcasts  is  performed  per  iteration  to  distribute  the  crosspoint  system 
to  all  of  the  p  processors.  Two  additional  global  reduction  operations  (not  represented  in  the 
figure)  are  executed  at  each  iteration  as  part  of  the  CG  algorithm  itself,  independent  of  the 
the  preconditioning.  These  operations  are  the  inner  products  that  compute  the  step  lengths 
in  the  vector  updates  of  CG  algorithm.  The  inner  product  arithmetic  scales  as  the  problem 
size,  but  the  message  volume  for  these  global  reductions  scales  with  the  granularity  only,  since 
the  contribution  from  each  processor  is  condensed  to  a  scalar  with  local  operations  before  any 
data  is  shared.  The  global  broadcasts  and  reductions  have  a  “self-synchronizing”  effect  on  the 
parallel  algorithm.  The  main  outcome  of  frequent  synchronization  is  that  most  of  the  measured 
communication  time  is  spent  by  processors  that  finish  their  computations  early  idling  while 
waiting  to  receive  messages  from  those  that  are  delayed. 

Three  main  characteristics  of  the  communication  requirements  of  the  algorithm  are:  the 
communication  pattern  employed  is  independent  of  the_  data,  the  number  of  messages  ex¬ 
changed  and  the  number  of  global  operations  per  iteration  are  independent  of  the  iteration 
number,  and  the  size  of  the  messages  exchanged  between  neighboring  processors  is  independent 
of  the  number  of  processors.  These  characteristics  permit  very  simple  analytical  models  of  par¬ 
allel  complexity  and  scalability  [7],  since  most  aspects  of  the  computation  and  communication 
can  be  estimated  by  considering  a  single  processor  on  a  single  iteration. 

To  estimate  the  computation  and  communication  complexity,  consider  that  the  subdomains 
are  logically  square,  i.e.  n<  =  Uj  =  n.  Since  the  algorithm  requires  that  only  the  points  on 
the  boundary  be  exchanged  between  adjacent  subdomains,  the  communication  complexity  is 
0{n)  per  subdomain.  On  the  other  hand,  the  computation  complexity  of  the  algorithm  for 
each  subdomain  is  0{v?)  for  the  unpreconditioned  CG  method  (a  fixed  number  of  stencil 
operations  at  each  point)  and  0{n^\ogn)  for  an  FFT-based  fast  elliptic  solver  used  on  each 
subdomain.  When  the  computational  work  increases  we  expect  that  any  irregularities  between 
the  timings  of  identical  phases  of  the  computation  performed  on  different  processors  will  also 
increase  with  the  same  power  law,  i.e.  or  greater.  Thus,  as  n  increases  the  differences 

between  the  moments  when  messages  sent  by  different  processors  physically  hit  the  network 
increase  much  faster  (0(n^))  than  the  message  transit  times  on  the  network  ((I?(n)).  Therefore, 
we  expect  that  for  large  n  there  will  be  practically  no  contention  for  the  physical  network.  The 
dominant  cause  for  degradation  of  efficiency  in  the  large  n  limit  is  synchronization.  In  the 
opposite  small  n  limit  the  messages  that  are  sent  are  tiny  and  the  actual  transmission  time  on 
the  network  is  much  smaller  than  the  sending/receiving  overheads.  Therefore,  the  dominant 
cause  for  degradation  of  efficiency  in  the  small  n  limit  is  latency. 

The  BPS  DD  algorithm  is  run  on  up  to  16  SparcstationELC  workstations  for  the  following 
subdomain  sizes:  16,  32,  64,  128,  256,  512.  The  largest  of  these  problems  corresponds  to  a 
square  containing  (4  x  512)^  w  4.19  x  10®  grid  cells  and  thus  to  a  matrix  containing  approx¬ 
imately  2  X  10’’  nonzero  real  entries.  As  partial  differential  equation  discretizations  go,  this 
is  a  large  problem.  Figures  16  and  17  compare  predicted  and  experimental  timings  for  all 
16  workstations,  in  two  types  of  tests.  In  the  first  test,  the  original  code  is  run  without  any 
synchronization  beyond  that  inherited  from  the  algorithm  itself.  We  examine  here  the  sending 
time  for  the  total  of  all  messages  per  iteration,  averaged  over  all  processors  and  all  iterations. 
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Figure  15:  Communication  pattern  for  a  processor  associated  with  an  internal  subdomain  in 
the  DD  method  applied  to  the  Poisson  problem.  A,  B,  and  C  are  distinct  message  types,  the 
compass  points  indicate  message  directions,  and  the  number  in  parentheses  is  the  message  size 
in  real  words. 


as  a  function  of  subdomain  size.  In  the  absence  of  a  global  clock,  it  is  impossible  to  measure 
the  a<;tual  message  transit  time,  which  would  be  a  difference  of  absolute  times  on  two  different 
processors.  Instead,  we  use  the  sending  time  (r,  in  the  notation  of  section  2.5).  It  was  shown 
in  section  2.5  that  for  the  non-preemptive  send  primitive,  the  sending  time  is  equal  to  the 
total  communication  time  minus  the  transmission  and  processing  time  of  the  leist  packet  of 
the  message  at  the  receiver.  The  sending  (resp.,  receiving)  time  is  defined  as  the  difference 
of  absolute  times  measured  on  the  same  clock  immediately  before  and  after  posting  a  send 
(resp.,  receive),  blocking,  and  returning.  Figure  16  shows  the  predicted  and  measured  sending 
times,  with  a  maximum  error  of  about  20%.  Since  many  of  the  messages  are  small  (one  real 
word),  overall  dependence  on  subdomain  size,  which  shows  up  only  in  the  vector  messages 
transmitting  boundary  data,  is  weak. 

Next,  we  modify  the  application  by  inserting  additional  synchronization  points  so  that  all 
neighbor  communications  wait  until  all  required  data  is  computed  and  ready,  and  we  measure 
the  time  interval  between  the  synchronization  moment  and  the  moment  when  the  nearest- 
neighbor  vector  messages  (only)  are  received  by  the  application  process,  averaged  over  all 
processors  and  all  iterations,  per  iteration.  Since  the  sender  and  receiver  are  synchronized  for 
the  nearest-neighbor  vector  messages,  the  measured  receiving  time  is  nearly  the  same  as  the 
actual  communication  time.  Figure  17  shows  the  predicted  and  measured  receiving  times  for 
the  neighbor  communication  pattern.  In  this  case  the  maximum  error  is  about  15%.  Since  the 
one-word  messages  from  inner  product  computations  and  coarse  grid  right-hand  side  broadcasts 
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Figure  16:  Estimated  sending  time  (TJ  versus  experimentally  measured  sending  time  for  the 
native  DD  code  (without  explicit  synchronization),  as  a  function  of  the  log  of  the  subdomain 
size.  Note  that,  unlike  Figures  11,  12,  and  U,  the  vertical  scale  is  linear  in  Figures  16,17,  and 
19. 

are  omitted,  a  clear  (linear  in  n)  trend,  induced  by  the  domination  of  the  transmission  time 
over  other  fixed  overheads,  is  visible. 

5.2  A  Multigrid  Application 
5.2.1  Algorithm 

The  second  model  application  is  transient  simulation  of  incompressible  Navier-Stokes  flow  in  a 
two-dimensional  square  cavity  filled  with  fluid,  driven  by  an  oscillatory  rigid  lid.  The  numerical 
method  is  based  on  a  standard  uniform  grid  spatial  discretization  and  implicit  time  discretiza¬ 
tion  for  a  velocity-pressure  formulation  of  Navier-Stokes  with  a  hybrid  space-paraUel/time- 
parallel  multigrid  solver.  A  multigrid  solver  uses  a  succession  of  grid  presenting  different 
refinements  of  the  same  problem,  in  order  to  iteratively  damp  the  component  of  the  error  at 
each  wavenumber  on  the  grid  for  which  its  particular  damping  factor  is  most  rapid,  rather  than 
damping  all  error  components  on  the  same  grid.  Space  parallelism  is  achieved  through  domain 
partitioning,  with  one  processor  per  subdomain,  as  in  the  first  model  application,  though  we 
permit  both  stripwise  and  boxwise  decomposition  in  this  case,  in  order  to  obtain  more  flexi¬ 
bility  in  the  number  of  subdomains,  while  still  preserving  the  uniformity  of  each  subdomain. 
Time  parallelism  is  achieved  by  assigning  identically  spatially  decomposed  time  pfanes  to  dis¬ 
joint  sets  of  processors.  The  motivation  for  time  parallelism  is  the  degradation  of  efficiency 
in  space  parallelism  that  is  due  both  to  degrading  perimeter-to-area  (or  surface-to-volume) 
ratios  of  conventional  implicit  methods,  and  to  degrading  convergence  rate  as  global  coupling 
is  sacrificed  in  the  MG  “smoother,”  which  is  the  error-reducing  operation  at  the  heart  of  MG, 
performed  on  a  partition  of  a  grid  at  a  given  refinement  level.  The  effectiveness  of  time  par- 
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Figure  17:  Estimated  receiving  time  (Tr)  versus  experimentally  measured  receiving  time  for 
neighbor  communication  pattern  in  the  artificially  synchronized  version  of  the  DD  code,  as  a 
function  of  the  log  of  the  subdomain  size. 

aUelism  is  counter-intuitive  because  of  causality.  Nonetheless,  it  is  more  effective  than  space 
parallelism  in  many  practical  parameter  ranges,  when  the  time  resolution  of  a  transient  flow 
is  required. 

In  the  limit  of  pure  time  parallelism,  p  processors  work  concurrently  on  p  different  time 
planes  of  the  transient  solution.  In  the  limit  of  pure  space  parallelism,  only  one  time  plane  is 
computed  at  a  time.  Multigrid  is  used  in  the  spatial  direction  only;  there  is  no  time  coarsen- 
i^8-  (Time  coarsening  is  worthy  of  attention  in  other  contexts  (see,  e.g.  [10]),  but  is  irrelevant 
to  our  immediate  purpose  for  this  second  application,  namely  to  introduce  a  communication 
complexity  that  scales  to  the  same  asymptotic  order  in  problem  size  as  the  computation  com¬ 
plexity.) 

A  multigrid  solver  is  defined  by  a  grid  coarsening  strategy,  a  cycle  for  visiting  successively 
coarsened  grids,  a  smoother  designed  to  reduce  the  highest  wavenumber  errors  on  a  grid  of 
a  given  level,  and  intergrid  transfer  operators  to  map  the  solution  or  its  residual  between 
grids  at  adjacent  level.  As  with  the  DD  application,  it  is  beyond  the  scope  of  this  paper 
to  provide  a  self-contained  specification  of  the  MG  algorithm.  It  should  suffice  to  specify 
for  cognoscenti  that:  the  spatial  coarsening  is  by  powers  of  two  in  a  simple  V-cycle  scheme, 
the  semi-implicit  method  for  pressure-linked  equations  (SIMPLE)  defines  the  linearization,' 
incomplete  LU  (ILU)  decomposition  the  smoother,  and  standard  fuU-weighting^is  used  for 
intergrid  transfers.  The  space  parallelism  enters  through  the  elimination  of  certain  off-diagonal 
blocks  of  the  ILU  factorization.  The  code  was  originally  written  for  the  Intel  Hypercube  by 
Horton  [9]. 

The  communication  patterns  and  the  amount  of  traffic  vary  with  the  aUocation  of  available 
processors  between  space  and  time,  as  well  as  with  the  refinement  of  the  spatial  grid,  with 
the  result  that  in  this  second  application  a  wide  range  of  message  sizes,  message  numbers 
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and  message  patterns  are  observed,  depending  on  three  factors:  the  number  of  physical  time 
steps  simultaneously  solved  for,  the  number  of  domain  partitions,  and  the  number  of  spatial 
coarsening  levels.  The  most  important  observation  about  the  computation  and  communication 
complexity,  however,  is  that  their  asymptotic  sizes  are  of  equal  order.  Consider  the  purely  time 
parallel  limit  of  p  planes  of  n  x  n  gridpoints.  Transferring  the  full  plane  of  data  between  time 
levels  is  an  0(71^)  operation,  which  is  the  same  as  the  0{n^)  arithmetic  complexity  of  the 
stencil  operations  of  residual  evaluation  and  ILU  smoothing  in  the  fine  grid  sweep  of  the  MG 
algorithm. 

5.2.2  Implementation 

In  Figure  18,  we  show  the  main  patterns  of  communication  generated  between  processors 
assigned  to  different  time-steps  (“in  time”)  and  between  processors  assigned  to  the  same  time 
step  (“in  space”),  p*  is  the  number  of  grid  partitions  (px  processors  are  assigned  to  solving  the 
problem  for  every  time  step),  while  pt  is  the  number  of  consecutive  time  steps  (pt  is  the  number 
of  groups  of  px  processors,  each  group  operating  on  a  different  time  step).  Global  operations  are 
not  shown  in  Figure  18;  in  general,  their  number  is  not  constant,  depending  oh  the  number  of 
grid  levels  (kept  fixed)  and  on  the  factorization  of  p  into  pt  x  p^.  The  large  messages  are  those 
carrying  grid  information  (labeled  G  in  the  figure)  between  processors  assigned  to  different 
time  steps.  The  size  of  these  messages  decreases  as  px  increases,  for  a  constant  number  of  time 
steps  Pt,  as  the  individual  space  domains  are  partitioned  over  more  processors.  For  a  given 
number  of  available  processors,  the  size  of  the  messages  increases  with  pt,  as  the  space  grid 
partitions  become  larger.  The  messages  labeled  “IR”  and  “IL”  carry  vectors  of  edge  data  right 
and  left  across  spatial  processor  boundaries  in  the  stripwise  decomposition  shown. 


Figure  18:  Principal  communication  patterns  for  a  single  time  step  in  the  MG  code,  featuring 
both  time  (pt  =  Z)  and  strip-based  space  (px  =  4)  parallelism  forp  =  12  processors. 

Figure  19  shows  experimental  versus  predicted  timings,  for  Px  =  1  and  pt  between  2  and 
12.  There  are  two  predicted  curves:  one  for  when  all  the  communication  operations  are 
synchronized  (and  therefore  the  contention  on  the  communication  network  is  maximum),  and 
one  for  the  idealized  case  of  no  contention.  For  more  than  three  workstation  processors,  the 
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network  (as  opposed  to  the  processing  overhead)  is  the  communication  bottleneck  and  thus  the 
communication  time  increases  linearly  with  the  number  of  processors.  On  the  other  hand,  for 
two  processors  the  processing  overhead  represents  the  actual  communication  bottleneck  and 
therefore  the  communication  time  does  not  increase  at  the  same  rate  between  two  and  three 
processors  as  in  the  other  cases.  This  effect  was  anticipated  at  the  end  of  section  4. 

The  measured  communication  time  is  bounded  by  the  limits  of  the  zero  and  majdmaJ  con¬ 
tention  predicted  curves.  The  difference  between  the  maximal  contention  prediction  and  the 
actual  communication  time  is  due  primarily  to  the  lack  of  synchronization.  The  estimated 
communication  times  assume  that  all  like  messages  are  sent  synchronously  by  all  processors. 
In  practice,  the  processors  do  not  finish  the  computation  phase  at  the  same  time  and  therefore 
the  message  sending  is  initiated  at  slightly  staggered  moments.  This  is  due  to  slight  work¬ 
load  imbalances  and  to  nondeterministic  factors  that  arise  even  when  identical  workstations 
have  the  same  amount  of  work  to  perform.  The  computation  time  is  influenced  by  the  cache 
memory  system,  interrupt  service,  task  switching  and  page  swapping  beyond  the  control  of  the 
application. 

As  in  the  DD  examples,  we  modify  the  application  so  that,  before  sending,  all  processors 
are  synchronized.  The  results  obtained  are  also  plotted  against  the  predictions  in  Figure  19. 
In  the  synchronous  case,  the  measured  data  are  very  close  to  the  predictions  (within  17%). 
Moreover,  the  difference  should  be  even  smaller  if  we  could  mecisure  the  real  communication 
times  (Tc)  and  not  just  the  sending  times  (T,).  As  a  conclusion,  the  difference  between  the 
predicted  communication  time  and  the  actual  results  expresses  in  some  way  the  degree  of  the 
application  synchronism.  When  the  measured  results  are  close  to  the  synchronous  predictions, 
the  processors  send  messages  at  almost  the  same  moments  in  time,  which  results  in  greater 
contention  on  the  communication  network  and  larger  communication  time. 

In  Figure  20  we  consider  different  numbers  of  processors  both  in  time  and  space.  Since 
the  main  data  traffic  occurs  between  consecutive  planes,  we  do  not  consider  the  processors  in 
the  last  plane  that  only  receive  data.  Between  processors  in  the  same  plane  a  large  number  of 
small  messages  (several  hundreds)  are  exchanged.  This  enforces  a  “natural”  synchronization 
and,  therefore,  the  time  differences  between  the  synchronized  and  non-synchronized  (original) 
version  of  the  application  are  smaller.  This  can  be  observed,  especially,  when  processors  in 
only  one  plane  have  to  send  data,  i.e.  pt  =  2.  On  the  other  hand,  for  =  4  there  are  3  planes 
that  concurrently  send  data  in  time:  plane  1  to  plane  2,  plane  2  to  plane  3  and  plane  3  to 
plane  4.  Since  the  messages  exchanged  in  the  same  plane  synchronize  only  with  processors 
in  that  plane,  the  processors  in  different  planes  are  not  so  tightly  synchronized  and  therefore 
the  differences  between  the  synchronized  and  non-synchronized  version  of  the  application  are 
larger. 

5.3  Discussion 

Each  of  the  two  applications  above  gives  rise  to  a  small  set  of  communication  subprograms, 
such  as  global  reduction  or  exchange  of  surface  (resp.,  volume)  data  between  spatially  (resp., 
temporally)  neighboring  processors.  These  subprograms  are  called  with  message  sizes  ranging 
from  one  word  to  the  order  of  the  number  of  words  of  data  in  the  problem.  The  hyperbolic 
model  performs  well  for  each  communication  subprogram  class.  It  even  has  some  value  (see 
Figure  16)  in  predicting  measurements  averaged  over  all  of  the  different  communication  sub¬ 
programs  in  the  algorithmically  correct  proportions. 
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Figure  19:  Predicted  sending  times  (Tg)  for  two  extreme  cases  of  zero  contention  on  the  commu¬ 
nication  network  (no  communication  overlap)  and  maximum  contention  on  the  communication 
network  (full  communication  overlap)  versus  the  measured  sending  times  for  both  the  native 
MG  code  and  the  artificially  synchronized  version  of  the  code.  These  results  are  for  maximal 
time  parallelism  (px  =  1  and  pt-p)  which  leads  to  the  largest  average  message  size. 
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1,264,402 
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4 
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2,110,917 

2,616,590 

4 

2 

509,125 

1,343,780 

1,158,536 

1,273,106 

Figure  20:  Predicted  and  measured  sending  times  (in  microseconds)  for  one  multigrid  V-cycle, 
for  varying  degrees  of  tinie  and  space  parallelism,  using  either  4  or  8  processors. 
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One  of  the  applications  (time-parallel  multigrid)  is  limited  by  network  contention,  while 
the  other  application  (domain  decomposition)  is  limited  only  by  irregularities  in  computation 
time  and  frequent  synchronization.  Both  limitations  are  serious  as  regards  scalability,  partic¬ 
ularly  in  cluster  computing  environments  without  dedicated  nodes.  Future  algorithmic  design 
should  be  heavily  influenced  by  such  communication  analyses.  In  particular,  the  inner  product 
operations  used  to  drive  the  conjugate  gradient  iterations  are  particularly  burdensome  and 
their  synchronization  cost  should  be  reduced  by  algorithmic  variants  that  block  several  consec¬ 
utive  iterations  into  one  set  of  global  reduction  operations.  Indeed,  the  synchronization  costs 
of  conjugate-gradient-type  methods  may  lead  to  a  resurgence  of  interest  in  Chebyshev-like 
methods. 

The  modifications  made  to  the  original  applications  programs  to  produce  artificial  syn¬ 
chrony  are  for  purposes  of  demonstrating  the  abibty  of  the  hyperbolic  model  to  predict  con¬ 
tention,  only,  and  are  not  recommended  in  production  versions. 

Tests  on  architectures  other  than  Ethernet  Sparcstation  clusters,  with  message-passing 
protocols  other  than  p4,  using  appbcations  other  than  domain  decomposition  and  time-parallel 
multigrid  are  planned,  to  further  define  the  range  of  applicability  of  the  hyperbolic  model. 

6  The  LogP  Model 

Recently,  a  new  model  of  parallel  computational  complexity  for  massively  parallel  processors, 
called  LogP,  has  been  developed  at  Berkeley  [4].  The  underlying  architecture  consists  of  mod¬ 
ules  connected  by  a  communication  network.  A  module  contains  a  processor,  a  local  memory 
and  a  network  interface.  The  model  assumes  that  send  and  receive  operations  are  performed  by 
the  main  processor,  i.e.  there  is  not  a  specialized  processor  to  perform  network  interface  func¬ 
tions.  This  means  that  during  the  send  or  receive  operations  the  processor  does  not  perform 
any  other  computation.  The  basic  version  of  the  LogP  model  assumes  that  all  messages  have 
the  same  size  and  that  this  size  is  small.  The  model  is  characterized  by  four  main  parameters: 

1.  L  -  the  upper  bound  for  the  delay  of  a  message  transmission  between  the  source  and 
destination  processors. 

2.  0  -  the  time  interval  required  to  send  or  receive  a  message.  During  this  time  the  processor 
cannot  perform  any  other  operations. 

3.  g  -  the  gap,  defined  as  the  minimum  time  interval  between  two  consecutive  message 
transmissions  or  receptions. 

4.  P  -  the  number  of  modules. 

When  a  small  message  is  sent,  according  to  the  LogP  model,  the  communication  time  is 
equal  to  the  sending  overhead  o,  plus  the  delay  time  L,  plus  the  receiving  overhead  o,  i.e. 
2o  +  L.  On  the  other  hand,  when  more  than  one  message  is  sent  by  the  same  processor,  a  new 
message  cannot  be  issued  earlier  than  max(5f,o)  and,  therefore,  the  communication  time  to 
send  n  consecutive  messages  is  (n  -  1)  max(5,  o)  -b  2o  -f  L.  The  first  term  accounts  for  sending 
the  first  n  -  1  messages,  while  the  last  two  account  for  sending  the  last  message  (see  Figure 
21).  Since  forn  =  1  the  second  expression  is  reduced  to  the  first,  we  consider  further  only  the 
second  one. 
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Figure  21:  The  time  diagram  for  sending  4  consecutive  messages,  from  PO  to  PI,  in  the  LogP 
model.  Here,  g  >  o. 


To  capture  the  LogP  parameters  in  the  hyperbolic  model  we  use  the  communication  graph 
from  Figure  9,  where  aw  =  hw  —  o,  ai,  —  L,bi,  =  0  and  a^  —  0,6c  =  9-  Since  the  LogP 
model  assumes  that  all  messages  are  of  a  small  fixed  size,  these  will  be  interpreted  as  packets 
in  the  hyperbolic  model,  while  consecutive  messages  sent  by  one  processor  (in  LogP )  will  be 
interpreted  as  packets  of  a  single  message.  Also,  because  all  packets  are  of  the  same  size,  we 
take  the  size  of  the  data  unit  and  the  packet  size  to  be  the  same  (i.e.  each  packet  contains 
exactly  one  data  unit).  By  applying  rule  1  to  the  communication  graph,  it  is  easy  to  see  that  the 
equivalent  communication  block  has  the  following  parameters:  o  =  aw  ■^■a-c-^o-L-^O'W  =  So+L 
and  6  =  rmx{bw,bc,bL)  =  max(g,o).  The  total  service  time  is  given  by  (15). 

When  we  write  x  0  in  the  hyperbolic  model,  we  are  referring  to  the  smallest  possible 
size  of  a  message  that  can  be  sent,  which  can  generally  be  much  smaller  than  the  packet  size. 
However,  in  this  case,  a  packet  consists  of  exactly  one  data  unit  (corresponding  to  a  message  in 
LogP)  and  therefore  a  message  cannot  be  of  a  size  smaller  than  a  packet  size.  To  accommodate 
this  restriction  within  the  formalism  of  the  hyperbolic  model,  we  take  x  =  n  ~  1,  where  n  is 
the  size  of  the  message.  Thus,  x  0  in  the  hyperbolic  model  and  n  =  1  in  the  LogP  model 
refer  to  the  identical  limit,  namely  that  of  the  smallest  message  that  can  be  sent.  Next,  if  we 
denote  by  Thyp{n)  (=  T(n  -  1;  a,  6))  the  communication  time  to  send  a  message  of  size  n  in  the 
hyperbolic  model  and  by  Tiogpip)  the  communication  time  to  send  n  consecutive  messages  in 
LogP  model,  we  obtain: 

2 

'^hypip)  =  ^  +  (”  -  TLogp{n)  =  a  +  (n  -  1)6. 

To  see  how  much  the  estimated  communication  times  for  both  models  may  differ,  we  consider 
the  ratio  Thyp{n)fTLogp{n): 

Thypjn)  _  +  (n  -  l)a6  +  (n  -  1)^6^ 

TLogp{n)  a^  +  2{n  -  l)o6  +  (n  -  1)^62 ' 


It  is  easy  to  verify  that  for  any  value  of  n  >  1  and  nonnegative  a  and  6,  we  have: 

3  Thypjn) 

4  “  Tiogpjn)  ~ 


(18) 


Further,  let  us  compute  the  sending  (resp.,  receiving)  time,  i.e.  the  actual  time  required 
by  a  processor  to  send  (resp.,  receive)  n  consecutive  messages,  for  both  models.  For  the 
LogP  model,  clearly,  we  have  (see  Figure  21)  Tt_Logpjn)  =  Tr_Logpjn)  =  no.  Next,  notice 
that  if  g  >  0,  after  a  message  is  sent,  the  processor  is  free  for  time  g  —  o  to  perform  other 
computations.  Since  we  have  interpreted  consecutive  messages  sent  by  the  same  processor 
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in  the  LogP  model  as  packets  of  a  single  message  in  the  hyperbolic  model,  between  any  two 
consecutive  packets  sent  or  received  in  the  hyperbolic  model,  the  processor  can  perform  other 
computations.  Therefore,  the  equivalent  sending  and  receiving  primitives  of  the  hyperbolic 
model  are  preemptive.  From  Figure  8  we  thus  have: 

TaJiypin)  =  Trjiyp(n)  =  ^  ^  +  (n  -  l)o  =  ~  ~  1)®- 

To  further  quantify  the  difference  between  sending/receiving  communication  times  estimated 
by  both  models,  we  form  Ta_hyplTa_L,ogP  (TrJiypITrjjogp)' 


TaJiyp  _  J  ^ 1 
TsJLogP 

which  gives  us  the  following  bounds  for  n  >  1: 

3  ^  Tajiypjn)  ^  ^ 

4  “  TaJLogP{n)  ~ 


(19) 


7  Conclusions 

A  two-parameter  hyperbolic  model  for  parallel  communication  complexity  on  general  dedicated 
networks  has  been  proposed  and  validated  by  experiments  with  test  programs  containing  com¬ 
munication  patterns  frequently  encountered  in  scientific  computations.  Because  of  the  way  its 
parameters  are  fit  to  experiments,  the  model  captures  both  small-message  and  large-message 
timing  behavior  well.  The  quality  of  agreement  between  model  and  measurement  at  intermedi¬ 
ate  message  sizes  suggests  that  two  parameters  are  adequate.  Each  communication  pattern,  in 
principle,  requires  its  own  set  of  parameters.  The  practical  utility  of  the  model  in  unstructured 
computations  may  therefore  be  limited.  Fortunately,  many  scientific  computations  calling  for 
parallel  supercomputing  rely  on  a  small  number  of  structured  communication  patterns,  so  the 
hyperbolic  model  is  tractable. 

In  the  limit  of  small  uniform  messages  that  affords  direct  comparison  with  the  state-of- 
the-art  LogP  model,  appropriate  for  tightly  coupled  architectures,  the  hyperbolic  and  LogP 
models  predict  the  same  timings  for  elementary  communication  operations  to  within  a  factor 
of  3/4. 

The  model  can  be  used  to  provide  insight  into  communication  performance  of  actual  dis¬ 
tributed  scientific  applications.  A  domain  decomposition  code  for  solving  elliptic  PDEs  and  a 
time-parallel  multigrid  method  for  transient  simulation  of  Navier- Stokes  cavity  flow  are  chosen 
for  demonstration  purposes,  because  of  their  different  synchronization/communication  ratios 
and  complementary  communication  patterns.  Complementary  bottlenecks  to  scalability  are 
thus  identified.  Realistic  analyses  of  communication  such  a.s  these  can  be  used  to  influence 
algorithmic  design,  for  a  given  architecture,  and  vice  versa. 
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exact  in  both  large  and  small  limits.  The  model  is  validated  on  a  dedicated  Ethernet  network  of  workstations  by 
experiments  with  communication  subprograms  arising  in  scientific  applications,  for  which  a  tight  fit  of  the  model 
predictions  with  actual  measurements  of  the  communication  and  synchronization  time  between  end  processes  is 
demonstrated.  The  model  is  then  used  to  evaluate  the  performance  of  two  simple  parallel  scientific  applications  from 
partial  differential  equations:  domain  decomposition  and  time-parallel  multigrid.  In  an  appropriate  limit,  we  also 
show  the  compatibility  of  the  hyperbolic  model  with  the  recently  proposed  LogP  model. 
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